Before beginning this assignment, please ensure you have access to R and RStudio; this can be on your own personal computer or on the IMT 573 R Studio Server.
Download the problemset3.Rmd file from Canvas or save a copy to your local directory on RStudio Server. Open problemset3.Rmd in RStudio and supply your solutions to the assignment by editing problemset3.Rmd.
Replace the “Insert Your Name Here” text in the author: field with your own full name. Any collaborators must be listed on the top of your assignment.
Be sure to include well-documented (e.g. commented) code chucks, figures, and clearly written text chunk explanations as necessary. Any figures should be clearly labeled and appropriately referenced within the text. Be sure that each visualization adds value to your written explanation; avoid redundancy – you do not need four different visualizations of the same pattern.
Collaboration on problem sets is fun and useful, and we encourage it, but each student must turn in an individual write-up in their own words as well as code/work that is their own. Regardless of whether you work with others, what you turn in must be your own work; this includes code and interpretation of results. The names of all collaborators must be listed on each assignment. Do not copy-and-paste from other students’ responses or code.
All materials and resources that you use (with the exception of lecture slides) must be appropriately referenced within your assignment.
Remember partial credit will be awarded for each question for which a serious attempt at finding an answer has been shown. Students are encouraged to attempt each question and to document their reasoning process even if they cannot find the correct answer. If you would like to include R code to show this process, but it does not run without errors, you can do so with the eval=FALSE option as follows:
a + b # these object dont' exist
# if you run this on its own it with give an error
Knit PDF, rename the knitted PDF file to ps3_YourLastName_YourFirstName.pdf, and submit the PDF file on Canvas.In this problem set you will need, at minimum, the following R packages.
# Load standard libraries
library('dplyr')
library('censusr')
library('stringr')
library("ggplot2")
In this problem set, we will be joining disparate sets of data - namely: Seattle police crime data, information on Seattle police beats, and education attainment from the US Census. Our ultimate goal is to build a dataset where we can examine questions around crimes in Seattle and the educational attainment of people living in the areas in which the crime occurred.
As a general rule, be sure to keep copies of the original dataset(s) as you work through cleaning (remember data provenance).
Load the Seattle crime data (crime_data.csv). You can find more information on the data here: (https://data.seattle.gov/Public-Safety/Crime-Data/4fs7-3vj5). This dataset is constantly refreshed online so we will be using the csv file for consistency. We will henceforth call this dataset the “Crime Dataset.” Perform a basic inspection of the Crime Dataset and discuss what you find.
crimes<-read.csv("crime_data.csv.bz2")
summary(crimes)
## Report.Number Occurred.Date Occurred.Time
## Min. :2.008e+08 01/20/2016: 186 Min. : 0
## 1st Qu.:2.008e+13 12/01/2015: 184 1st Qu.: 904
## Median :2.011e+13 11/25/2015: 182 Median :1500
## Mean :1.644e+13 09/23/2014: 181 Mean :1364
## 3rd Qu.:2.015e+13 11/20/2009: 180 3rd Qu.:1925
## Max. :2.017e+13 01/01/2008: 177 Max. :2359
## (Other) :447731 NA's :3
## Reported.Date Reported.Time Crime.Subcategory
## 12/31/2008: 238 Min. : 0 CAR PROWL :127036
## 03/31/2014: 196 1st Qu.: 950 THEFT-ALL OTHER : 45702
## 05/12/2014: 193 Median :1410 BURGLARY-RESIDENTIAL: 42044
## 07/05/2016: 193 Mean :1355 THEFT-SHOPLIFT : 41948
## 01/21/2014: 192 3rd Qu.:1820 MOTOR VEHICLE THEFT : 38364
## 05/19/2014: 187 Max. :2359 BURGLARY-COMMERCIAL : 19844
## (Other) :447622 NA's :3 (Other) :133883
## Primary.Offense.Description Precinct Sector
## THEFT-CARPROWL :112134 : 6 M : 36926
## THEFT-SHOPLIFT : 41948 EAST : 66920 U : 34612
## THEFT-OTH : 39838 NORTH :143043 K : 32675
## VEH-THEFT-AUTO : 33491 SOUTH : 64605 B : 31606
## BURGLARY-FORCE-RES: 25353 SOUTHWEST: 42045 D : 30831
## THEFT-BUILDING : 18366 UNKNOWN : 2788 E : 30189
## (Other) :177691 WEST :129414 (Other):251982
## Beat Neighborhood
## K3 : 14383 DOWNTOWN COMMERCIAL: 42077
## U1 : 12854 NORTHGATE : 26487
## L2 : 12619 CAPITOL HILL : 26421
## M1 : 12477 QUEEN ANNE : 23309
## E2 : 12383 SLU/CASCADE : 20232
## M2 : 12356 UNIVERSITY : 17804
## (Other):371749 (Other) :292491
head(crimes)
sapply(crimes,class)
## Report.Number Occurred.Date
## "numeric" "factor"
## Occurred.Time Reported.Date
## "integer" "factor"
## Reported.Time Crime.Subcategory
## "integer" "factor"
## Primary.Offense.Description Precinct
## "factor" "factor"
## Sector Beat
## "factor" "factor"
## Neighborhood
## "factor"
First we see that Report.Number is a scientific number and need to change that
options(scipen = 999)
Then we see that Occured.Date is a factor and we need to convert it into a date variable
crimes$Occurred.Date<-as.Date(crimes$Occurred.Date,"%m/%d/%Y")
Discuss your findings here…
Let’s start by looking at the years in which crimes were committed. What is the earliest year in the dataset? Are there any distinct trends with the annual number of crimes committed in the dataset?
#arranging the dataset according to date to find the earliest one
crimes %>% arrange(Occurred.Date)
We see that the earliest occurred date that is not NA is 1908.
#Examining annual trends
#creating a year column and grouping by year
crimes<-crimes %>% mutate(year = format(Occurred.Date,"%Y"))
crimes %>% group_by(year) %>% tally() %>% ggplot(aes(year, n)) + geom_bar(stat = 'identity',FIll='blue') +
labs(subtitle="Carrier Vs Average Delay",
y="total number of crimes",
x="Year",
title="Bar chart")
## Warning: Ignoring unknown parameters: FIll
We roughly see from this graph that the number of crimes is staedy until 2007 and then increases exponentially and again becomes steady till 2017 Let’s subset the data to only include crimes that were committed after 2011 (remember good practices of data provenance!). Going forward, we will use this data subset.
crimes2011<-crimes %>% filter(year>2011)
crimes2011
How frequently are the beats in the Crime Dataset listed? Are there any anomolies with how frequently some of the beats are listed? Are there missing beats?
Note: beat is a sort of police district, the area a police officer patrols, roughly corresponding to neighborhoods. See Wikipedia https://en.wikipedia.org/wiki/Beat_(police). See also the map at https://www.seattle.gov/police/information-and-data/tweets-by-beat.
#counting how frequently beats are listed
table(crimes2011$Beat)
##
## B1 B2 B3 C1 C2 C3 CS CTY D1 D2 D3 DET E1 E2
## 1509 6048 6807 6821 4482 3830 3817 0 1 6354 5941 5189 5 5824 8408
## E3 F1 F2 F3 G1 G2 G3 H1 INV J1 J2 J3 K1 K2 K3
## 5611 3416 5135 4143 2540 4326 3355 0 0 4516 5099 5441 5256 5125 9056
## L1 L2 L3 LAPT M1 M2 M3 N N1 N2 N3 O1 O2 O3 Q1
## 4624 8136 4472 0 7813 8328 7624 0 4249 5937 5916 3208 2163 2443 4302
## Q2 Q3 R1 R2 R3 S S1 S2 S3 SS U1 U2 U3 W W1
## 6338 7217 4876 5724 5758 4 4003 4252 4834 1 8023 7001 6933 3 4011
## W2 W3 WS X9
## 4924 4147 1 0
beat_tally<-crimes2011 %>% group_by(Beat) %>% tally()
beat_tally
beat_tally %>% ggplot(aes(Beat, n)) + geom_bar(stat = 'identity',FIll='blue') +
labs(subtitle="Carrier Vs Average Delay",
y="total number of crimes",
x="Beat",
title="Bar chart")
## Warning: Ignoring unknown parameters: FIll
we see from the above table and plot that out of 275320 observations 1509 beats are not listed and some of the beat also we see that some of the beats such as LAPT,INV,CTY,01,02,03,CS,WS do not follow the beat format SOME OF THE BEATS HAVE ZERO FREQUENCY WHICH DOES NOT MAKE SENSE.
Load the data on Seattle police beats (police_beat_and_precinct_centerpoints.csv). You can find additional information on the data here: (https://data.seattle.gov/Land-Base/Police-Beat-and-Precinct-Centerpoints/4khs-fz35). We will henceforth call this dataset the “Beats Dataset.”
#loading the data
beat<-read.csv("police_beat_and_precinct_centerpoints.csv")
beat
#renaming a column for joining purposes
names(beat)[1]<-"Beat"
beat
We see that there are 57 beats in this dataset with their locations.
Does the Crime Dataset include police beats that are not present in the Beats Dataset? If so, how many and with what frequency do they occur? Would you say that these comprise a large number of the observations in the Crime Dataset or are they rather infrequent? Do you think removing them would drastically alter the scope of the Crime Dataset?
# checking if crime dataset has beats not present in beats dataset
notprsentbeats<-anti_join(crimes2011,beat, by= "Beat")
## Warning: Column `Beat` joining factors with different levels, coercing to
## character vector
table(notprsentbeats$Beat)
##
## B1 B2 B3 C1 C2 C3 CS CTY D1 D2 D3 DET E1 E2
## 1509 0 0 0 0 0 0 0 1 0 0 0 5 0 0
## E3 F1 F2 F3 G1 G2 G3 H1 INV J1 J2 J3 K1 K2 K3
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## L1 L2 L3 LAPT M1 M2 M3 N N1 N2 N3 O1 O2 O3 Q1
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Q2 Q3 R1 R2 R3 S S1 S2 S3 SS U1 U2 U3 W W1
## 0 0 0 0 0 4 0 0 0 1 0 0 0 0 0
## W2 W3 WS X9
## 0 0 1 0
#We see here that 1509 missing instances plus some instances of CTY,DET(5 instances),S,WS that are missing #as we see that there are almost 1500 of such values they are frequent and need to be removed
Let’s remove all instances in the Crime Dataset that have beats which occur fewer than 10 times across the Crime Dataset. Also remove any observations with missing beats. After only keeping years of interest and filtering based on frequency of the beat, how many observations do we now have in the Crime Dataset?
#removing blank values
compcrimes2011<- crimes2011[!(is.na(crimes2011$Beat) | crimes2011$Beat==""),]
#removing beats that occur less than 10 times
beatcount<-compcrimes2011 %>% group_by(Beat) %>% count() %>% filter(n>10)
#frequency of beats
table(beatcount$Beat)
##
## B1 B2 B3 C1 C2 C3 CS CTY D1 D2 D3 DET E1 E2
## 0 1 1 1 1 1 1 0 0 1 1 1 0 1 1
## E3 F1 F2 F3 G1 G2 G3 H1 INV J1 J2 J3 K1 K2 K3
## 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1
## L1 L2 L3 LAPT M1 M2 M3 N N1 N2 N3 O1 O2 O3 Q1
## 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1
## Q2 Q3 R1 R2 R3 S S1 S2 S3 SS U1 U2 U3 W W1
## 1 1 1 1 1 0 1 1 1 0 1 1 1 0 1
## W2 W3 WS X9
## 1 1 0 0
compcrimes2011<- compcrimes2011[compcrimes2011$Beat %in% beatcount$Beat,]
compcrimes2011
#to confirm if the same beats are present in the new dataset
table(compcrimes2011$Beat)
##
## B1 B2 B3 C1 C2 C3 CS CTY D1 D2 D3 DET E1 E2
## 0 6048 6807 6821 4482 3830 3817 0 0 6354 5941 5189 0 5824 8408
## E3 F1 F2 F3 G1 G2 G3 H1 INV J1 J2 J3 K1 K2 K3
## 5611 3416 5135 4143 2540 4326 3355 0 0 4516 5099 5441 5256 5125 9056
## L1 L2 L3 LAPT M1 M2 M3 N N1 N2 N3 O1 O2 O3 Q1
## 4624 8136 4472 0 7813 8328 7624 0 4249 5937 5916 3208 2163 2443 4302
## Q2 Q3 R1 R2 R3 S S1 S2 S3 SS U1 U2 U3 W W1
## 6338 7217 4876 5724 5758 0 4003 4252 4834 0 8023 7001 6933 0 4011
## W2 W3 WS X9
## 4924 4147 0 0
#we have 273796 observations after filtering out the years and beats
To join the Beat Dataset to census data, we must have census tract information.
First, let’s remove the beats in the Beats Dataset that are not listed in the (cleaned) Crime Dataset.
Then, let’s use the censusr package to extract the 15-digit census tract for each police beat using the corresponding latitude and longitude. Do this using each of the police beats listed in the Beats Dataset. We recommend not to use a for-loop for this but instead rely on dedicated functions, in particular the ‘apply’ family of function (these somewhat correspond to the list comprehension in R). Add a column to the Beat Dataset that contains the 15-digit census tract for the each beat. (HINT: you may find censusr’s call_geolocator_latlon function useful)
# removing the beats in the Beats Dataset that are not listed in the (cleaned) Crime Dataset.
beatfinal <- beat %>% filter((Beat %in% compcrimes2011$Beat))
beatfinal
#using censusr library
library("censusr")
beatfinal$ctract <-apply(beatfinal, 1, function(row) call_geolocator_latlon(row["Latitude"], row["Longitude"]))
beatfinal
We will eventually join the Beats Dataset to the Crime Dataset. We could have joined the two and then found the census tracts for each beat. Would there have been a particular advantage/disadvantage to doing this join first and then finding census tracts? If so, what is it? (NOTE: you do not need to write any code to answer this)
#Because of the size of the dataset,doing a join first will only increase the computation time which doesnt make sense.
Once we have the 15-digit census codes, we will break down the code based on information of interest. You can find more information on what these 15 digits represent here: https://transition.fcc.gov/form477/Geo/more_about_census_blocks.pdf.
First, create a column that contains the state code for each beat in the Beats Dataset. Then create a column that contains the county code for each beat. Find the FIPS codes for WA State and King County (the county of Seattle) online. Are the extracted state and county codes what you would expect them to be? Why or why not?
beatfinal$stateCode <- substr(beatfinal$ctract,start = 1,stop = 2)
beatfinal$countyCode <- substr(beatfinal$ctract,start = 3,stop = 5)
beatfinal
County FIPS: 53033 King County. This is the County FIPS:53033 King County Postcode page list. This corresponds to our data.
The census data uses an 11-digit code that consists of the state, county, and tract code. It does not include the block code. To join the census data to the Beats Dataset, we must have this code for each of the beats. Extract the 11-digit code for each of the beats in the Beats Dataset. The 11 digits consist of the 2 state digits, 3 county digits, and 6 tract digits. Add a column with the 11-digit code for each beat.
beatfinal$GEO.id2<- substr(beatfinal$ctract,start = 1,stop = 11)
beatfinal$GEO.id2<-as.numeric(beatfinal$GEO.id2)
Now, we will examine census data (census_edu_data.csv). The data includes counts of education attainment across different census tracts. Note how this data is in a ‘wide’ format and how it can be converted to a ‘long’ format. For now, we will work with it as is.
The census data contains a “GEO.id” column. Among other things, this variable encodes the 11-digit code that we had extracted above for each of the police beats. Specifically, when we look at the characters after the characters “US” for values of GEO.id, we see encodings for state, county, and tract, which should align with the beats we had above. Extract the 11-digit code from the GEO.id column. Add a column to the census data with the 11-digit code for each census observation.
census <- read.csv("census_edu_data.csv.bz2")
census
#There already exists a dataset GEO.id2 wu=ith the 11 digit extraction
Join the census data with the Beat Dataset using the 11-digit codes as keys. Be sure that you do not lose any of the police beats when doing this join (i.e. your output dataframe should have the same number of rows as the cleaned Beats Dataset - use the correct join). Are there any police beats that do not have any associated census data? If so, how many?
beatcensus<-left_join(beatfinal,census,by ="GEO.id2")
beatcensus
Now join the Crime Dataset to our joined beat/census data. We can do this using the police beat name. Again, be sure you do not lose any observations from the Crime Dataset. What is the final dimensions of the joined dataset?
beatcrimecensus<-left_join(compcrimes2011,beatcensus,by = "Beat")
## Warning: Column `Beat` joining factors with different levels, coercing to
## character vector
dim(beatcrimecensus)
## [1] 273796 46
I have one extra dimension for total number of beats per group. Once everything is joined, save the final dataset for future use. We’ll revisit it in future problem sets!